{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering Documentation Example\n",
    "\n",
    "<h2 id=\"k-means\">K-means</h2>\n",
    "\n",
    "<p><a href=\"http://en.wikipedia.org/wiki/K-means_clustering\">k-means</a> is one of the\n",
    "most commonly used clustering algorithms that clusters the data points into a\n",
    "predefined number of clusters. The MLlib implementation includes a parallelized\n",
    "variant of the <a href=\"http://en.wikipedia.org/wiki/K-means%2B%2B\">k-means++</a> method\n",
    "called <a href=\"http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf\">kmeans||</a>.</p>\n",
    "\n",
    "<p><code>KMeans</code> is implemented as an <code>Estimator</code> and generates a <code>KMeansModel</code> as the base model.</p>\n",
    "\n",
    "<h3 id=\"input-columns\">Input Columns</h3>\n",
    "\n",
    "<table class=\"table\">\n",
    "  <thead>\n",
    "    <tr>\n",
    "      <th align=\"left\">Param name</th>\n",
    "      <th align=\"left\">Type(s)</th>\n",
    "      <th align=\"left\">Default</th>\n",
    "      <th align=\"left\">Description</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr>\n",
    "      <td>featuresCol</td>\n",
    "      <td>Vector</td>\n",
    "      <td>\"features\"</td>\n",
    "      <td>Feature vector</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>\n",
    "\n",
    "<h3 id=\"output-columns\">Output Columns</h3>\n",
    "\n",
    "<table class=\"table\">\n",
    "  <thead>\n",
    "    <tr>\n",
    "      <th align=\"left\">Param name</th>\n",
    "      <th align=\"left\">Type(s)</th>\n",
    "      <th align=\"left\">Default</th>\n",
    "      <th align=\"left\">Description</th>\n",
    "    </tr>\n",
    "  </thead>\n",
    "  <tbody>\n",
    "    <tr>\n",
    "      <td>predictionCol</td>\n",
    "      <td>Int</td>\n",
    "      <td>\"prediction\"</td>\n",
    "      <td>Predicted cluster center</td>\n",
    "    </tr>\n",
    "  </tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Cluster methods Example\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession.builder.appName('cluster').getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Within Set Sum of Squared Errors = 0.11999999999994547\n",
      "Cluster Centers: \n",
      "[ 9.1  9.1  9.1]\n",
      "[ 0.1  0.1  0.1]\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.clustering import KMeans\n",
    "\n",
    "# Loads data.\n",
    "dataset = spark.read.format(\"libsvm\").load(\"sample_kmeans_data.txt\")\n",
    "\n",
    "# Trains a k-means model.\n",
    "kmeans = KMeans().setK(2).setSeed(1)\n",
    "model = kmeans.fit(dataset)\n",
    "\n",
    "# Evaluate clustering by computing Within Set Sum of Squared Errors.\n",
    "wssse = model.computeCost(dataset)\n",
    "print(\"Within Set Sum of Squared Errors = \" + str(wssse))\n",
    "\n",
    "# Shows the result.\n",
    "centers = model.clusterCenters()\n",
    "print(\"Cluster Centers: \")\n",
    "for center in centers:\n",
    "    print(center)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alright let's code through our own example!"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
